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SUMMARY 


The utilization of minimum dis- 
tance classification methods in remote 
sensing problems, such as crop 
species identification, is considered. 
Minimum distance classifiers belong to 
a family of classifiers referred to as 
sample classifiers. In such classi- 
fiers the items that are classified 
are groups of measurement vectors 
(e.g. all measurement vectors from an 
agricultural field), rather than in- 
dividual vectors as in more conven- 
tional. vector classifiers. 

Specifically in minimum distance 
classification a sample (i.e. group of 
vectors) is classified into the class 
whose known or estimated distribution 
most closely resembles the estimated 
distribution of the sample to be 
classified. The measure of resemblance 
is a distance measure in the space of 
distribution functions. 

The literature concerning both 
minimum distance classification pro- 
blems ani distance measures is review- 
ed. Minimum distance classification 
problems axe then categorized on the 
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basis of the assumption made regard- 
ing the underlying class distribution. 

Experimental results are presented 
for several examples. The objective of 
these examples is to: (a) compare the 

sample classification accuracy {% 
samples correct) of a minimum distance 
classifier, with the vector classifi- 
cation accuracy (% vector correct) of 
a maximum likelihood classifier; (b) 
compare the sample classification 
accuracy of a parametric with a non- 
parametric minimum distance classifier. 
For (a), the minimum distance classi- 
fier performance is typically 5 % to 
10 % better than the performance of the 
maximum likelihood classifier. For 
(b), the performance of the nonparame- 
tric classifier is only slightly 
better than the parametric version. 

The improvement is so slight that the 
additional complexity and slower speed 
make the nonparametric classifier un- 
attractive in comparison with the para- 
metric version. In fact disparities 
between training and test results sug- 
gest that training methods are of much 
greater importance them whether the 
implementation is parametric or non- 
parametric. 


INTRODUCTION 

A fairly common objective of 
remote sensing in connection with earth 
resources is to attempt to establish 
the type of ground cover on the basis 
of the observed spectral, radiance. 

The examination of systems capable of 
achieving this objective shows that a 
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certain duality of system types exists . 
Landgrebe refers to the two types as 
image-oriented systems and numerically- 
oriented systems . The duality exists 
primarily for historical reasons as a 
consequence of the independent develop- 
ment of photographically oriented and 
computer oriented technology. The 
primary distinction between the two 
system types is that in image oriented 
systems a visual image is an essential 
part of the analysis scheme while in 
numerically oriented systems the 
visual image plays a secondary role . 

In Fig. 1 the location of the "Form 
Image" block in relation to the 
"Analysis" block characterizes the two 
system types . 

In numerically oriented remote 
sensing systems it is frequently pos- 
sible to design the data collection 
system in such a manner that classifi- 
cation becomes a problem in pattern 
recognition. This situation prevails 
if one attempts to study earth re- 
sources through the utilization of 
multispectral data-images . The term 
multispectral image (i.e. without the 
modifier data ) is used to refer to one 
or more spectrally different superim- 
posed pictorial images of a scene. 

The modifier data is added to indicate 
that images are stored as numerical 
arrays as opposed to visual images . 

To obtain a multispectral data- 
image of a scene, the scene in question 
is partitioned on a rectangular grid 
into small cells (pixels) and the 
radiance from each pixel for each wave- 
length band of interest is measured 
and stored. The set of measurements 
for a pixel constitutes the measure- 
ment vector for that pixel. A multi- 
spectral data-image for a scene is 
simply the complete collection of all 
measurement vectors for the image. 

The spatial coordinates (i.e. row and 
column number) of each pixel are of 
course also recorded to uniquely 
identify each measurement vector. 

Fig. 2 depicts the situation. 

The methods used to generate 
multispectral data images can conven- 


iently be divided into two categories. 
In the first category, film is used to 
record the image. The film is subse- 
quently scanned and digitized to pro- 
duce a data-image. The multispectral 
property is obtained either by scanning 
several images photographed through 
different spectral windows , and over- 
laying the data; or by utilizing color 
film and separating the spectral com- 
ponents during the scanning procedure. 
In the second category the image is 
generated electrically and stored in 
an electrically compatible form, 
usually on magnetic tape as either an 
analog or digital signal. The electri- 
cal signal to be stored can be gener- 
ated by a number of different systems*, 
the multispectral scanner and return 
beam vidicon probably qualify as the 
two most common examples. For the 
scanner the multispectral property is 
obtained by filtering of the spectral 
signal collected through a single aper- 
ture prior to recording, or by the 
superposition of several unispectral 
images collected through different ap- 
ertures . 

As already stated, pattern recog- 
nition techniques can serve as the 

basis for affecting classification of 
multispectral data-images . Much of 
pattern recognition theory is formu- 
lated in terms of multidimensional 
spaces with the dimensionality of the 
space equal to the dimensionality of 
the vectors to be classified. This 
vector dimensionality is, of course, 
determined by the number of attributes 
or properties of each pixel to be con- 
sidered in the classification (e.g. 
number of spectral bands). Classify- 
ing a multispectral data-image by 
classifying the observation vectors 
from such an image on a pixel-by-pixel 
basis falls naturally into this common 
pattern recognition framework. In con- 
trast to this vector-by-vector approach 
there are classification schemes which 
collectively will be referred to as 
"sample classification schemes". In 
such schemes all vectors to be classi- 
fied are first segregated into groups 
(i.e. samples) such that all the vec- 
tors in a group belong to the same 
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class. The whole group of vectors is 
then classified simultaneously. The 
minimum distance method considered is 
one such classification scheme. 

In utilizing sample classification 
schemes two distinct problems can be 
identified. The first is concerned 
with partitioning the measurement vec- 
tors into homogeneous groups , while 
the second is concerned with the 
classification of these groups. Ex- 
cept for the comments in the next 
paragraph consideration is restricted 
to the second problem. 

It frequently occurs for multi- 
spectral data-images that many of the 
adjacent measurement cells belong to 
the same class. For example in an ag- 
ricultural scene each physical field 
typically contains many pixels . In 
fact it is precisely this condition 
that prompts the investigation of sam- 
ple classification schemes. In such 
situations the physical field bounda- 
ries serve to define suitable samples 
for problems like crop species identi- 
fication, and it is in this context 
that sample classifiers might also be 
referred to as per-field classifiers. 

It is apparent that for the situation 
just described one method of automat- 
ically defining samples is to devise a 
scheme that automatically locates phy- 
sical field boundaries ^n the multi- 
spectral data-imagery ’ . For the 
minimum distance classification results 
presented later, physical field bound- 
aries will actually be used to define 
the samples , but the field boundaries 
are located manually rather than auto- 
matically. A second and perhaps more 
promising approach to the problem of 
defining samples is via observation 
space clustering. In this approach 
vectors from an arbitrary area are 
clustered in the observation space, and 
all the vectors assigned to the same 
cluster constitute a sample irrespec- 
tive of their location in the arbitrary 
choosen area. In this case the term 
fields no longer seems appropriate 
and consequently the term sample class- 
ifier is preferred over the term per- 
field classifier. 


It is apparent that sample class- 
ification schemes cannot be used in all 
situations where a vector-by-vector 
approach is possible. A basic require- 
ment is that the data to be classified 
can either be segregated into homogen- 
eous samples or occur naturally in 
this form. Where the minimum distance 
scheme can be applied it intuitively 
has several potential advantages over 
a vector-by-vector classifier; in 
particular it is potentially faster 
and more accurate. 

It seems logical that provided 
the time required to automatically de- 
fine the samples is not too great, then 
sample classifiers should be faster 
than a vector- by-vector classifier. 

This is of considerable importance in 
utilizing a numerically oriented remote 
sensing system to survey earth re- 
sources because a characteristic of ' 
such surveys is the tremendous volume 
of data involved. One would also an- 
ticipate that the vector classification 
accuracy (% vectors correctly classi- 
fied) for vector-by-vector classifiers' 
would be lower than the sample classi- 
fication accuracy {% samples correctly 
classified) for sample classifiers. 

The reason for this is that in sample 
classifiers all the information con- 
veyed by a group of vectors is used to 
establish the classification of each 
vector, whereas in vector-by- vector 
classifiers each vector is treated 
separately without reference to any 
other vector. In a sense sample class- 
ifiers utilize spatial information be- 
cause vectors are classified as groups, 
which naturally have some spatial ex- 
tent. No spatial information is used 
in vector-by-vector classifiers, con- 
sequently, sample classifiers should 
perform better since spatial informa- 
tion is certainly of some value . 


MINIMUM DISTANCE CLASSIFICATION 

Problem Formulation 

In a certain sense minimum dis- 
tance classification resembles what is 
probably the oldest and simplest ap- 
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proach to pattern recognition , namely 
"template matching" . In template 
matching a template is stored for each 
class or pattern to be recognized (e.g. 
letters of the alphabet) and an un- 
known pattern (e.g. an unknown letter) 
is then classified into the pattern 
class whose template best fits the un- 
known pattern on the basis of some 
previously defined similarity measure. 
In minimum distance classification the 
templates and unknown patterns are dis- 
tribution functions and the measure of 
similarity used is a distance measure 
between distribution functions . Thus 
an unknown distribution is classified 
into the class whose distribution func- 
tion is nearest to the unknown distri- 
bution ir terms of some predetermined 
distance measure. In practice the dis- 
tribution functions involved are usu- 
ally not known, nor can they be ob- 
served directly. Rather a set of ran- 
dom measurement vectors from each dis- 
tribution of interest is observed and 
classification is based on estimated 
rather than actual distributions . 

It is necessary to define more 
precisely what constitutes a suitable 
distance for minimum distance classi- 
fication. Mathematically the terms 
"distance" and metric are used inter- 
changeably. For our purpose it is con- 
venient to distinguish between the two 
terms. In essence all that is required 
for a well-defined minimum distance 
classification rule is a measure of 
similarity between distribution func- 
tions which need not necessarily pos- 
sess all the properties of a metric. 

The term distance refers to any suit- 
able similarity measure; the term 
metric is used in the normal mathemat- 
ical sense. More specifically a metric 
on a set S is a real valued function 
&(.,.) defined on S X S (X indicates 
cartesian product) such that for arbi- 
trary F,G,H in S 


(a) 6(F,G) — 0 1 

(b) (1) S(F,F) =0 2 

(2) If 5(F ,G) = 0 then F = G 3 

(c) 5(F,G) = 5(G,F) 4 

(d) S(F,G) + 5(G,H) ^ 6(F,H) 5 


A distance, as used herein, is defined 
to be a real valued function d(.,.) on 
S X S such that for arbitrary F,G,H in S 
at least metric properties a,b(l) and 
usually b(2) and (c) hold. For theoreti- 
cal proofs it is in fact often desire- 
able to require that d be a true metric 
while in practical application such a 
restriction is usually not necessary. 

Wot only are distances between 
individual distribution functions of 
interest but since each class could 
conceivably be represented by a set of 
distribution functions the distance 
between sets of distributions is also 
of interest. Definition 1 defines the 
distance between sets of distributions. 

Definition 1 - Let the distance 
d(F,G) be defined for all F,G, in A, 
where A is an arbitrary set of cdf's 
of interest. If and A 2 are non- 
empty subsets of A then the distance 
d(A^, A 2 ) between the sets Aq and Ag 
is defined as 


d(A x , A 2 ) = Inf d(F,G) 6 

FeAi 
GeA2 

Note that Definition 1 applies to 
finite and infinite sets of distribu- 
tion functions. Of course, if the sets 
are finite then taking the infimum is 
equivalent to taking the minimum. 

Futhermore, if each set consists 
only of a single distribution function 
then the distance between the sets is 
precisely the distance between the 
distribution functions. The distance 
between a distribution function and a 
set of distribution functions is also 
included as a special case. It is 
necessary to make some comments about 
the usage of the notation d(F,G) . 

Some of the distance measures consid- 
ered are expressed in terms of prob- 
ability density functions (pdf's) 
rather than cumulative distribution 
functions (cdf's). The convention 
adopted is that the notation d(F,G) is 
still used and referred to as the dis- 
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tance between cdf's, even though the 
distance is expressed in terms of the 
densities of F and G (i.e. in terms of 
f and g) . 

The minimum distance classifica- 
tion scheme can now be formally defin- 
ed. It is convenient to use a decision 
theoretic framework for this purpose. 

In general to specify a problem in this 
framework it is necessary to specify: 

(a) Z - the sample space of the observ- 
ed random variable . 

(b) ft - the set of states of nature; 
that is, the set of possible cdf's of 
the random variable. If the function- 
al form of the cdf is known, then ft 
can be identified with the parameter 
space . 

# 

(c) A - the action space; that is the 
set’ of actions or decisions available 
to the statistician. 

£d) L (a,F) - loss function defined on 
AXft which measures the loss incurred if 
Feft is thg true state of nature and 
action aeA is the action taken. 

The general formulation of the 
minimum distance problem in this frame- 
work follows : 

(a) Z = (q-dimensional Euclidean 
space ) 

(b) ft = [ft (l) , ft (2) ,... . ,ft (k) ] where ft^ 
is the set of possible distribution 

functions for the ith class, i = 1, 2, 

• • * , k • 

(c) A = [ajL» a2» ...» a. ] where is 
the decision to decide the random sam- 
ple- to be classified belongs to the 
ith class, i = 1, 2, ..., k. 

(d) L(a,F) = 0 if Feft^^ and action a^ 
was taken 

L(a,F) = 1 otherwise. 

A decision rule is a function de- 
fined on Z and taking values in A. The 
minimum distance decision rule is given 
by definition 2. 


Definition 2 - Let Y be the vector 
of all sample observations. The mini- 
mum distance decision rule DMD:Zr>3t is 
Dmd(Y) = ai (l.e., decide the random 
sample to be classified belongs to 
class i) in case 


d(F-, A (i) ) = 


*<*»• A(J) > 

Where A^ is the set of cdf's select- 
ed to represent the ith class and Pjj 
is a sample-based estimate <5f the cdf 
of the random sample classified. 


Several items in definition 2 re- 
quire clarification. The vector Y in- 
cludes not only the random sample to 
be classified, but also any other ob- 
servations used in the classification 
procedure. For example, if training 
samples are used for each class, these 
are included in Y. The sets a(^) also 
require comment. may be the set 

of all possible -distributions for 
class i (i.e. A'*; = ft' 1 )) or it may 
be a subset of ft' 1 ) or the sample- 
based estimates of a set cdf's select- 
ed to represent class i. Finally the 
term sample-based estimate is used to 
refer to any estimate of a cumulative 
distribution function or its corre- 
sponding density which is based on a 
random sample from the distribution in 
question. A number of suitable esti- 
mators exist^ and the present formula- 
tion does not restrict the type of 
estimator. Later attention will be 
focused on distance measures based on 
densities. In the parametric case the 
densities will be estimated by esti- 
mating the parameters describing the 
densities (parametrically estimated 
pdf's). In the nonparametric case den- 
sity estimates will be based on histo- 
grams (density histogram estimation). 

To obtain a density histogram estimate 
of a pdf the observation space is 
partitioned into square bins and the 
probability. density estimate in any 
bin is the percent of vectors used to 
estimate the density which fall in the 
bin. 


A number of special cases of the 
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above formulation are no# considered. 
These special cases are basically a 
consequence of making different as- 
sumptions regarding 0, and A = [A^ 1 ', 
A'^), . A'k)]. i n Type I problems 

the sets of distribution functions re- 
presenting the classes are assumed to 
be known sets. Actually, this pro- 
blem is not of great interest from a 
practical point of view, since class 
distributions are not normally known, 
but it is interesting from a theoreti- 
cal point of view because of its rela- 
tive simplicity. 

Type I - The 8^^’s are known sets of 
cdf 's 

Case (a) The sets 8'^ are infinite and 
AU) = fl(i) 

Case (b) The sets 8^) are finite and 
A(i) = fl(i) 

Case (c) The sets 8^) = (single 

cdf /class) and A Uj = F' 1 ' 

Type II problems differ from Type 
I problems in that the possible dis- 
tribution functions for each class are 
known to be q-variate distributions 
but are otherwise unknown. Consequent- 
ly, all distributions used in the mini- 
mum distance decision rule must be est- 
imated. Since in practice only a 
finite number of estimated distribu- 
tions can be utilized this factor must 
be considered in formulating the pro- 
blem. If the seta of states of nature 
(e.g. the fl'D’s) are infinite the 
infinite 3ets must somehow be replaced 
by a representative finite set. A 
similar attitude must be adopted if it 
is known a priori that the sets 8'*-) 
are finite but it is not known precise- 
ly how many distribution functions each 
8^) contains (e.g. how many subclasses 
of wheat are there?); or even if the 
precise number is known, it may not be 
known how to obtain a random sample 
for each distribution function (i.e. 
how are samples representing different 
subclasses of wheat selected?). Final- 
ly, in the finite case, even if a ran- 
dom sample for each distribution func- 
tion of interest can be obtained. 


their number may be so large that for 
practical reasons it may be desireable 
to use a smaller number of representa- 
tive distributions . Thus , the need 

arises for a method to select a repre- 
sentative set of distribution func- 
tions from a larger (possibly infinite) 
set. To do this assign a distribution 
H«(i) to 8^), i - 1, 2, ...» k. That 
is the events to which probability 
mass is assigned by H*'*' are sets of 
distributions in 8^i). To select a 
random set of cdf’s from 8'*' (i.e. to 
select a random set of training sam- 
ples for the ith class) is now equiv- 
alent to selecting a random sample 
from ^ ' . 

The above formulation is rather 
complicated in that a distribution over 
a space of functions is involved. This 
complexity can be avoided by restrict- 
ing consideration to a parametric fam- 
ily characterized by s real parameters. 
Making the logical assumption that a 
one to one correspondence exists be- 
tween cdfs in and points in the 

parameter space e(i)(=E s ), it is ap- 
parent that assigning a distribution 
H*(i) to 8'*) is equivalent to. assign- 
ing some other distribution to 

the parameter space 0'^'. Consequent- 
ly, in the parametric case rather than 
deal with which is a cdf on a 

set of distribution function, only H'i) 
which is a cdf in E s need be consider- 
ed. 

It is perhaps worthwhile to re- 
state the above ideas in terms of mul- 
tispectral data-imagery from an agri- 
cultural scene before stating them in 
a more formal manner. In the interest 
of simplicity and since it is the case 
of primary interest assume that the 
true q-dimensional distribution of the 
radiance measurements from each field 
belong to the same parametric family 
which can be characterized in the para- 
metric space E s . This family may have 
a finite or infinite number of members 
(i.e. subclasses). Further assume that 
all the fields in a class (e.g. wheat) 
can be described by a suitable distri- 
bution H'*' over the parameter space. 

A set of training fields for each class 
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is selected at random. Because of our 
formulation this is equivalent to se- 
lecting a random sample from the para- 
meter space according to the assumed 
distribution over the parameter space 
for that class (i.e. H^ 1 '). For each 
of the randomly selected training 
fields the radiance measurements are 
used to get an estimated cdf • for that 
field. In this way estimated cdf's 
for a representative set of training 
fields are obtained for each class. 

An unknown field is then assigned to 
the class that has a training field 
whose estimated cdf is nearest to the 
estimated cdf of the unknown field. 
Since the problem as stated is parame- 
tric, one would normally, though not 
necessarily, use parametrically esti- 
mated cdf's . 

Type II problems in which the 
n ' 1 ) ' s are unknown are now f ormally 
described. While prime interest is 
centered in the case where is a para- 
metric family this restriction is not 
imposed in stating the problem. The 
description of Type II problems is com- 
plicated by the fact that, the descrip- 
tion of the sets A'*' is rather in- 
volved. 

Type II - The are Unknown Sets 

of cdf's 

Case (a) - The sets fi(i) are infinite 
in number and A^ 1 ' = fiM. (i). The sets 
now described. First a set 
of population cdf's corresponding to a 
representative set of Mi training 
fields for class i, i = 1, 2, ..., k 
is selected. Let Jfyp ' be this set 
for the ith class. 1 That isl^M. (i ) -la 
a random sample of size Mi for 
A sample-based cdf is then obtained 
for each cdf in flMi'*' for i = 1, 2, 
...» k. The resultant set of sample- 
based estimated cdf's is For 

the case where parametrically estimat- 
ed cdf's are used can also be 

considered to be a random sample of 
size Mi in the parameter space accord- 
ing to a distribution H^ 1 ). 


Case (b) - The sets are finite and 

A(i) = o(i) or A^ 1 ) = If 

the Jr 1 ' are finite sets (i.e. finite 
number of subclasses) then it is de- 
sireable to let A' 1 = Q^ 1 ^, where 

fi(i) is the set of sample-based esti- 
mated cdf's for the ith class. In 
cases where the resultant number of 
subclasses is impracticably large and/ 
or only a random set of Mi training 
fields is available it is necessary to 
let A^ 1 ' = ftM.( 1 'CJr 1 ' and proceed as 
in case (a) . . 

Case (c) - The set Q^ 1 ) = F^ 1 ) (Single 
cdf per class) and A' 1 ' = Ffj(i). 

Distance Measures 

The importance in statistics of 
distances between cdf's has, of course, 
long been recognized; according to 
Samuel and Bachi^ their use appears 
.to fall into two broad categories . 

(a) Used for descriptive purposes. 

For example, as an indicator to quanti- 
tatively specify how near a given dis- 
tribution is to a normal distribution. 

(b) Use in hypothesis testing, which 
is, of course, a special case of de- 
cision theory. 

There is a tendency for distance 
functions sufficiently sensitive to 
detect minor differences in distribu- 
tion functions (i.e. category (a) use) 
to be somewhat involved functions of 
the observations, with the result that 
their use as test statistics in hypoth- 
esis testing has been limited because 
of the complicated distribution theory. 
On the other hand, distance functions 
whose theory is simple enough to be 
readily used as test statistics often 
do not distinguish distribution func- 
tions sufficiently well. Since in 
minimum distance classification inter- 
est is naturally centered on good dis- 
crimination between distribution func- 
tions, therefore distance functions 
that fall into category (b) are nor- 
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mally used. Since the appropriate 
distribution theory for hypothesis 
testing is then in general not known 
it is impossible to theoretically 
compute probability of error, but it 
may be possible to establish reason- 
ably tight upper bounds. The approxi- 
mate probability of error can of 
course be determined experimentally. 

The literature abounds with refer- 
ences to distance measures and no at- 
tempt will be made to give a complete 
bibliography. A representative sample 
of distance measures is given in Table 
1. This Table includes the most widely 
used distance measures because of their 
obvious importance, as well as more ob- 
scure distance measures whose applica- 
tion to the present problem appears 
reasonable. In addition a few miscel- 
laneous distance measures have been 
included to give an indication of the 
variety of distances that have been 
suggested. The distances included in 
this Table are: Cramer-Von Mises7»8, 

9 5IO 5 Kolmogorpv- 

Divergence^-8 ,14 ,1 

Jeffreys-Matusita , 0 

Variational^ ,18 ,19 1 Kullback-Leibler 
15 , 20 ^ swain-Fu 21 , Mahalanobis 22 » 2 3 , 
Samuels Bachi^, and Kiefer-Wolowitz^ . 
The references cited are by no means 
comprehensive. In selecting the re- 
ferences the attempt has been made to 
cite only the original source in 
addition to survey papers. The paper 
by Darling9, Sahlerl < - ) and to a cer- 
tain extent Kalaithl5 fall in this 
latter category. 

Most of the references cited are 
concerned only with the univariate 
forms of the distance measure. With 
the exception of the Samuels-Bachi 
distance, the extention to the multi- 
variate forms is quite natural. Since 
it is the multivariate forms that are 
of interest, these, rather than the 
more common univariate forms, are given 
in Table 1. For the Samuels-Bachi dis- 
tance multivariate forms other than 
the one presented may be possible. 

Table 1 also contains information 
regarding the metric properties of the 


Smirnov-*-^ »-*- 2 »9 ,10 
5 , Bhattacharyya!5 ,16 

13,14,17 Knl mnan-mv 


distance measures when used in conjunc- 
tion with three families of distribu- 
tion functions. The families consider- 
ed are: C, the family of q-variate 

absolutely continuous distribution 
functions; MVN, the family of q-variate 
normal distribution functions; and 
MVNj , the family of q-variate normal 
distribution functions with equal co- 
variance matrices. Since MVN and MVNj; 
are subsets of C it is, of course , 
true that a metric in C is also a 
metric in MVN and MVNj; . A metric in 
MVNj need not, however, be a metric in 
MVN or C. 

Because of the importance of the 
multivariate normal distribution, ex- 
pressions for the distance between two 
such distributions are given in Table 
2 for each of the distances measured 
in Table 1 in those instances where the 
expressions are known. 

The distances listed in Table 1 
are discussed in the references cited 
and no attempt will be made to discuss 
them except for some general comments 
pertaining to their use in minimum 
distance classification. 

Since a large variety of distance 
measures is available , the problem nat- 
urally arises as to which distance mea- 
sure to use in a given problem . Un for- 
tunately, no complete answer to this 
question is presently available, but 
some general comments are possible. 

The distribution-free properties* that 
make the Cramer-Von Mises and 
Kolmogorov-Smirnov distances so popu- 
lar in the univariate case do not apply 
in the multivariate case. Since it is 
the multivariate case that is of inter- 
est these distances lose their special 
appeal. Intuitively a distance like 
the Kolmogorov-Smirnov distance does 
not appear to be as good a distance 


* In the univariate case the distribu- 
tion of the Kolmogorov-Smirnov and 
the Cramer-Von Mises distances between 
two estimated distribution functions is 
independent of the underlying distri- 
butions being estimated, provided 
appropriate estimators are used. 
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Table 1 

Multivariate Forms of Distance Measures and 
Their Metric Properties 


Metric in 


Name 

Form 

_c 

MVN 

MVNr. 

Cramer-Von Mises 

1_ 

W = { f* (G( jc) - F(x)) 2 dx) 2 

Yes 

Yes 

Yes 

Kolmogorov -Smi rnov 

K = Sup^ |G(x) - ?(x ) | 

Yes 

Yes 

Yes 

Divergence 

J = /*Iri(^^-)(f(x)-g(£) )(U 

No 

No 

Yes 

Bhattacharyya Distance 

1 

B * -LnJ (f(x)g(x)) 2 dx 

No 

No 

Yes 

Jeffreys -Mat us i t a 
Distance 

1 

M «* {/"(/g(x) - /f(x)) 2 dx) 2 

Yes 

Yes 

Yes 

Kolmogorov Variational 
Distance 

K(p) “ /"*|P f ,g(x)‘P f f(x)|dx 

Yes 

Yes 

Yes 

Kullback-Leibler 

Numbers 

L fg " f(i>di 

No 

No 

Yes 

Svain-Fu Distance 

D +D 
f g 

P I 

1 v-hJ (q*2) 2 

Where 0 ■ yrr-u'P 1 

~g f g 

No 

No 

Yes 


1 




Mahalanobis Distance 




Yes 

Samuels-Dachi Distance 

1 

U - {/ 1 (F‘ 1 (o)-o‘ 1 (a)]do( 2 
0 

No 

No 

No 


where F ^(a) * Inf{c|Q nQ 
c a 





q 

and Q » {x| E x <c} # Q * {x|F(x)>o> 
C i=l 1 a 




Klefer-Wolfovitz 

Distance 

V = /“’|F(x)-0(x)|e‘l x l(ix 

Yes 

Yes 

Yes 


Notation 

(1) P, G are multivariate cdf's with densities f, g; means p ; covariances E^., E ; 

and prior probabilities p f , p g . 8 

(2) / () dx designates a multivariate integral. 

(3) For Mahalanobis distance F and G are normal with means and u and have common 
covariance E. 

(U) || designates the absolute value or vector norm. 

(5) t designates the transpose. , 
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Table 2 

Distances Between Two Multivariate Normal cdf's 


Name 


Distance 


Divergence 


Bhattacharyya 

Distance 


Jeffreys-Matusita 

Distance 


J = 


B = 


2 £ trtE ^ E - 1 ]!^]^] 1 

-1 , det^E^+E ]) 

« > 

V 


E +E -1 det(±[E +E ]) 

[ ~ p" £ ] p Ln 2 y /2 

5 -g 2 -f “g 2 {det(E ) det (E )} 1/2 

I K 


M = [2{1 - 


{det(E )det(E )} 1/U E -E . 

r exp ( -jr( M--M ) [ — c- 2 -] (H--U ))>] / 

(det(|[E +E ]>} 1/2 2 

2 r g 


Kullback-Leibler 

Numbers 


8wain-Fu Distance 


Mahalanobis 

Distance 


det(E J 


L f S ■ I * stur ♦ 1 ♦ 1 


•Lr!*, 


1 

uif liif - !* I (<l + 2) 

T = - — — where D * { ; —} 

* S tr (2. )~ 1 (iL f -]^HH :f -M g ) 


1 ‘ 

^ * Ua^a g ) t r 1 ( ]ig - iLf )} 2 ,(E=E f =E g ) 


Notation 

(1) t means transpose 

(2) det mea n s determinant 

(3) tr means trace 

(h) The normal distributions involved have means u_ and u and covariance matrices E and E . 

“T H5 f 6 
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measure as those involving integration 
over the whole space. It is also more 
difficult to compute in parametric 
situations then some of the integral 
relations. The Samuels-Bachi distance 
suffers from a similar computational 
disadvantage . 

The Divergence, Bhattacharyya dis- 
tance, Jeffreys-Matusita distance, 
Kolmogorov variational distance and 
Kullhack-Leihler numbers all belong to 
a class of distance measures which can 
be written as the expected value of a 
convex function of the likelihood 
ratio*. In fact Ali and Silvey^ have 
shown that the expected value of any 
convex function of the likelihood ratio 
has properties that might reasonably be 
demanded of a distance measure. In 
addition Wacker^ has shown that in fea- 
ture selection such distance measures 
have a weak relationship to the prob- 
ability of error. Kalaithl5 proved the 
same relationship for Divergence and 
the Bhattacharyya distance. Since the 
class of distance measures under dis- 
cussion is based on pdf's there is 
probably a tendency for these distances 
to reflect differences in pdf's rather 
than cdf's. 

Of the distances based on likeli- 
hood ratios the Bhattacharyya distance 
seems to have been gaining in favor. 

The prime reason for this is apparently 
the close relation between probability 
of error and Bhattacharyya distance, 
as well as the relative ease of com- 
puting Bhattacharyya distance in theo- 
retical problems. Other properties of 
the Bhattacharyya distance which en- 
hance its prestige as a distance mea- 
sure have been pointed out by Lainiotis 
2 ° and Stein 2 T. A property of consid- 
erable theoretical utility is the close 
relation between the Bhattacharyya dis- 
tance B, the Jeffreys-Matusity distance 
M and the affinity p namely 

M = 2(l-p )!/2 = gd-e -2 ) 1 / 2 8 

Where 

B = -Lnp 9 


* The likelihood ratio of densities 
f(x) and g(x) is f(x)/g(x). 


p(F,G) = ^„(f(x)g(x)) l/2 % 10 

Because of the above relationships 
minimum distance classifications made 
on the basis of the Bhattacharyya dis- 
tance, Jeffreys-Matusita distance or 
affinity all yield identical results , 
and consequently have identical proba- 
bility of error. 

The Jeffreys-Matusita distance is, 
however, a metric in a much larger 
class of distributions (see Table l) . 
This means that theoretical derivations 
regarding probability of error can be 
made using the metric properties of 
the Jeffreys-Matusita distance in this 
larger class, and the results are ap- 
plicable if classification is effected 
using Bhattacharyya distance or affin- 
ity as well. This property has been 
used extensively by Matusita. 

While no strong preference for 
any distance measure can presently be 
demonstrated the theoretical properties 
of the Bhattacharyya distance suggests 
that it might be a reasonable choice 
and the experimental results presented 
later are based on this distance mea- 
sure . 

Minimum Distance Classification And 
Probability of Error 

Considerable literature exists on 
the minimum distance method with 
Matusita 2 °"35 and Wolfowitz36-39 being 
the chief contributors. Wolfowitz's 
work is concerned primarily with esti- 
mation while much of Matusita 's work 
deals with the decision problem. Con- 
tributions have also been made by Gupta 

^0, Cacoullous^*^ 5 Sirvastava^3 an d 

Hoeffding and Wolfowitz^. 

In considering minimum distance 

decision rules a common requirement is 
to insist that by using arbitrarily 
large samples the probability of mis- 
classifying a sample can be made ar- 
bitrarily small. This is the notion 
of consistency and it is a reasonable 
demand if the pairwise distance be- 
tween all the sets of distributions 
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associated with each class is greater 
than zero or 

d( n (i) , > 0 

for all i, j = 1, 2, . .., k; i ^ j 11 
In parametric problems in which some 
distribution is assigned to the para- 
meter space the condition specified by 
11 is equivalent to requiring that 
there is no overlap of regions of the 
parameter space associated with differ- 
ent classes. 

It has been shovn^®*^*^ that 
any minimum distance classification 
problem for which equation 11 holds is 
consistent (probability of misclassi- 
fication approaches zero as sample 
sizes approach infinity) provided the 
distance and distribution estimator 
utilized satisfy certain conditions. 
These conditions are that the distance 
used must be essentially a metric 
(metric property b(2) need not hold) 
and that for the particular distance 
measure and estimator used, the prob- 
ability that for the particular dis- 
tance measure and estimator used, the 
probability that the distance between 
the true and estimated distribution 
can be made arbitrarily small is one 
for infinite sample size. Further it 
is shown that certain distances and 
estimators satisfy these conditions. 

In particular in the normal case these 
conditions are satisfied by using para- 
metrically estimated densities and the 
Bhattacharyya distance35. Similar con- 
sistency results are not known for 
density histogram estimators. The 
known properties of consistency are 
summarized more rigorously and in 
greater detail by Wacker^. 

It is the property of consistency 
described in the previous paragraphs 
which makes the minimum distance deci- 
sion rule potentially so attractive. 

In essence consistency says that if 
the condition specified by 11 is satis- 
fied, and if sufficiently large samples 
are used then the probability of mis- 
classifying a sample should be very 
small. Unfortunately in classifying 
multispectral data-images two problems 
arise . 


(1) The number of distributions asso- 
ciated with any class is very large 
(perhaps almost infinite) and it is 
not practical to attempt to store all 
possible subclass distributions as is 
essentially assumed in deriving the 
consistency result described. 

(2) It appears that the condition of 
equation 11 is frequently not satisfied, 
or at least that distributions from 
different classes are often so nearly 
alike that the number of samples re- 
quired to distinguish them is impract- 
ically large. 

When the condition specified by 
equation 11 is violated to the extent 
that q( 0 and ft(j) overlap on a set of 
non zero probability then the minimum 
distance decision rule can obviously 
no longer be consistent; in this situ- 
ation the probability of misclassify- 
ing a sample will be finite regardless 
of sample size. Under these circum- 
stances, except for the simple para- 
metric example treated by Wacker^ , 
essentially no results are available. 


RESULTS 

Three different classifiers were 
used to obtain the experimental re- 
sults. These classifiers are known as 
LARSYSAA , PERFIELD and LARSYSDC. 

LARSYSAA is a vector-by-vector classi- 
fier based on the maximum likelihood 
decision rule^5 f while PERFIELD and 
LARSYSDC sure minimum distance classi- 
fiers utilizing the Jeffreys-Matusita 
or equivalent (Bhattacharyya) distance 
LARSYSAA and PERFIELD are based on the 
Gaussian assumption and utilize para- 
metrically estimated pdf's while 
LARSYSDC utilize density histograms to 
estimate the pdf's. All three classi- 
fiers assume equal subclass probabil- 
ities and operate in the supervised 
mode*. 

* Supervised refers to the fact that 
samples whose classification are 
known are available to "train" the 
classifier. 
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Two examples are discussed. The 
first example compares the sample 
classification accuracy (% samples 
correct) of a parametric with a non- 
parametric minimum distance classifier. 
The second example compares the vector 
classification accuracy (% vectors 
correct) of the parametric maximum 
likelihood classifier LARSYSAA with 
the parametric minimum distance class- 
ifier PERFIELD. The data used in both 
examples are essentially the same but 
as subsequently described the training 
procedures differ considerably. 

The two examples discussed are 
problems in species identification of 
agricultural fields. In this context 
it is usually logical to assume that 
all the measurement vectors from a 
given physical field belong to the 
same class. This assumption was made 
in defining samples for the minimum 
distance classifiers and in determining 
the classification accuracy of the max- 
imum likelihood classifier. In other 
words, for the minimum distance class- 
ifiers each sample to be classified 
represents a physical field, while for 
the maximum likelihood classifier all 
vectors from a field are assumed to 
belong to the same class. 

The data for the examples to be 
discussed has 13 spectral bands and 
was collected by the University of 
Michigan Scanner. For ease in refer- 
ring to different spectral bands the 
wavelength channel number correspon- 
dence of Table 3 is utilized. The 
data was collected at an altitude of 
3000 ft., between 9:^+5 and 10:1+5 a.m. 
E.D.T. , on June 30, 1970, from Purdue 
University flightlines 21, 23 and 2h 
respectively. The exact location and 
orientation of these flightlines, which 
are located in Tippecanoe County, 
Indiana, is shown in Fig. 3. The 
flightlines extend the 2k mile length 
from the north to the south end of the 
county and are roughly equally spaced 
in the east-west direction. Since the 
scanner geometry is such that at an 
altitude of 3000 feet the field of 
view is roughly 1 mile, the area cover- 
ed by the three flightlines, approxi- 


mately 72 square miles, is about 1/7 
of the total area in the county . The 
scanner resolution and sampling rate 
are nominally three and six millira- 
dians respectively. This means that 
at nadir the scanner "sees" a circle 
about 9 feet in diameter and that the 
spacing between adjacent pixels is 
about 18 feet. Since the scanner reso- 
lution and sampling rate are indepen- 
dent of look angle the distance between 
adjacent pixels is approximately 30% 
larger at' the edge of the scanner's 
field of view with a corresponding 
change in the shape and area "seen" by 
the scanner. At the sampling rate in- 
dicated there are 220 samples across 
the width of a flightline and each 
flightline contains 5000 to 6000 lines. 
This means each flightline contains 
somewhat more than 10° pixels of which 
10% to 20% are typically used for test 
purposes . 

For both examples four principle 
ground cover categories are considered; 
wheat, corn, soybeans and other. 
Although the other class includes a 
considerable variety of ground cover 
most of the agricultural fields in 
this category are either small grains 
( other than wheat ) or forage crops . 
There are also some bare soils and 
diverted-acre fields . Some natural 
categories such as trees and water are 
also included in this class . For most 
of the subcategories for the class 
other ground cover is fairly complete, 
but the spectral properties of the 
ground cover are quite variable from 
field to field within a subcategory. 
Most of the wheat in the flightline 
was natyre abd readt for 

was mature and ready for harvest . In 
fact some portion of it had already 
been harvested. For corn and soybeans 
the crop canopy at flight time was 
such that the ground was not covered 
by vegetation when viewed from above 
and consequently the radiance is 
greatly influenced by the soil type . 
This fact makes it difficult to dis- 
criminate corn and soybeans at this 
time of year and consequently high 
classification accuracies are not to 
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be expected, especially since corn 
and soybeans constitute a considerable 
fraction of the ground cover. 

Table 3 

Correspondence Between Channel Numbers 
and Spectral Bands 

Spectral Band 
(Micrometers ) 
0.40-0.44 
0.46-0.48 
0.50-0.52 
0.52-0.55 
0.55-0.58 
0 . 58 - 0.62 
0.62-0.66 
0.66-0.72 
0 . 72 - 0.80 
0.80-1.00 

1.00- 1.40 
1.50-1.80 

2.00- 2.60 

While the particular training pro- 
cedure used in each example is differ- 
ent some general observations are pos- 
sible . It is evident that some of the 
variables which affect radiance tend 
to be constant within a physical field, 
but vary from field to field. Such 
variables are usually related to farm 
management practices and include such 
factors as variety of species , fertil- 
ization rates, crop rotation practices, 
etc. Also the variability in soil type 
can normally be expected to be greater 
between fields than within fields. 
Consequently it is not uncommon for 
all data from one field to be fairly 
"uniform” but still be quite different 
from the data from another field; even 
though the class (species) is the same 
in both fields . In terms of probabil- 
ity densities the density from each 
individual field might reasonably be 
approximated by a normal distribution; 
in that it is typically unimodal and 
reasonably symmetrical, but the data 
from several fields combined frequent- 
ly exhibit severe multimodality . 

Under these circumstances, in order 
that the Gaussian assumption is approx- 
imately satisified (for classifiers 
making this assumption) , subclasses 


are usually defined for each main 
class, such that the distribution for 
each subclass is unimodal. Perhaps 
if data from a sufficient variety of 
fields could be combined for a given 
crop species a unimodal distribution 
would result for each main class and 
the definition of subclasses would not 
be necessary, even for a parametric 
classifier. The class distribution in 
this case would naturally be broader 
than the distribution of any "subclass" 
of which it is composed. It is pre- 
sently not known in the above situation 
whether better classification is 
achieved with parametric (Gaussian) 
classifiers by using many subclasses 
whose distribution are relatively nar- 
row, or using fewer subclasses with 
broader distribution. In practice 
there appears to be a tendency toward 
the definition of many subclasses. In 
nonparametric classifiers it should of 
course not be necessary to define sub- 
classes as there is no need for densi- 
ties to be unimodal. 

On the basis of the above discus- 
sion a fairly general parametric model 
which at least qualitatively behaves 
much like the actual multispectral 
data results when every field associat- 
ed with each main class is considered 
as a potential subclass. The varia- 
tion in distribution parameters from 
field to field is accounted for by a 
distribution over the parameter space. 
This is precisely the problem; pre- 
viously formulated at Type II case (a). 

Example 1 - Parametric vs Nonparametric 

The -classifications performed for 
this example can be segregated into 
the four categories shown below. 

1) Classifications with the parametric 
classifier PEFFIELD 

a) Every training field treated as a 
subclass. 

b) Data from all training fields for 
each principle class combined (no sub- 
classes ) . 

2) Classifications with the nonpara- 
metric classifier LARSYSDC 


Channel Number 

1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 
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a) Every training field treated as a 
subclass . 

b) Data from all training fields for 
each principle class combined (no sub- 
classes ) . 

In the classification procedure 
each flightline was treated as a sepa- 
rate data set . The training and 
classification method is described for 
one flightline with other flightlines 
receiving similar treatment. Initial- 
ly test and training data must be de- 
fined. Every field of any significant 
size whose classification had been de- 
termined by field observation was in- 
cluded as a possible test or training 
field. These fields were segregated 
into the four principle classes. 

Roughly 10 % of the fields in each class 
were then selected at random to serve 
as training fields . The remaining 
fields were used as test fields . Table 
4 gives a break down of the number of 
test and training fields for each 
flightline. After the training fields 
had been selected the subclass or class 
densities were estimated and stored. 

The test fields were then classified 
on the basis of their estimated densi- 
ties by the minimum distance rule. The 
computations to estimate a density 
function for PERFIELD are substantially 
simpler than for LARSYSDC since for 
PERFIELD only the mean and covariance 
need.be estimated while for LARSYSDC 
the density histogram must be generat- 
ed. A bin size of 5 was used for the 
density histograms in PERFIELD. (The 
data ranges was 0 to 256) . Only 3 of 
the 13 channels were used in perform- 
ing the classifications. These were 
selected in a more or less arbitrary 
manner, although it was known that the 
selected set ( 1,8,11) were among the 
better subsets of channels. 

Table 4 

Number of Test and Training Fields 

Number of Test (Training) Fields 
Flight- Soy- 
line Total Wheat Corn beans Other 

21 218(22) 23(2) 79(8) 57(6) 59(6) 

23 141(15) 18(2) 58(6) 55(6) I0(l) 

24 156(18) 19(2) 52(6) 43(5) 42(5) 


The results of the classification 
are shown in Fig. 4. Rather than pre- 
sent the classification results for 
each flightline individually the per- 
formance averaged over the three 
flightlines is given. The results 
therefore give some indication of the 
classification accuracy one might ex- 
pect on the average for this type of 
data for the training method used. In 
view of the random nature of the train- 
ing procedure it is felt that this is 
a more meaningful presentation than 
quoting the results for each flight- 
line individually. 

Example 2 - Maximum likelihood vs Mini- 
mum Distance Classification 

For this example the data from 
flightlines 21, 22, and 23 was classi- 
fied using: 

a) The parametric maximum likelihood 
classifier LARSYSAA. 

b) The parametric minimum distance 
classifier PERFIELD. 

The training procedure in this 
case is considerably different than 
the procedure for Example 1. In this 
case small areas approximately one 
acre in size were selected from flight- 
lines 21, 23, and 24 on this basis of 
a sampling scheme. The sampling 
scheme simply used every nth acre in 
the flightline belonging to the class 
in question as a "training acre". 

The data from the acres selected in 
this manner was used to train the 
classifier. In this manner 59 wheat 
acres , 44 corn acres , 23 soybean acres 
and 46 other acres were selected. The 
sampling rate n was different for the 
various principle classes. If every 
training acre were treated as a sepa- 
rate subclass a total of 172 subclasses 
result. This number exceeds the capa- 
bilities of the classification pro- 
grams. Consequently it was necessary 
to reduce the number of subclasses to 
a reasonable number. This was accom- 
plished by means of a clustering pro- 
gram which groups together the acres 
within each principle class whose esti- 
mated pdf's are similar . As a result 
of this grouping the number of sub- 
classes defined for the principle 
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classes: Wheat, Corn, Soybeans and 

Other were 4, 10, 6 and 10 respective- 
ly. Density histogram estimates of 
the resulting 4 wheat subclasses are 
shown in Fig. 5. Note that even after 
clustering considerable evidence of 
multimodality still exists, particu- 
larly for the first subclasses. In 
fact in some channels the contribution 
of all 4 acres assigned to subclass 1 
are clearly evident. It is possible 
that this data should have been segre- 
gated into a greater number of sub- 
classes. After the subclasses had 
been defined by clustering the statis- 
tics (means and covariance) were com- 
puted for each subclass . The feature 
selection capability of LARSYSAA 5 was 
then used to select the "best" 4 of the 
13 channels for classification. This 
selection is based on the average di- 
vergence between all possible subclass 
pairs , excluding subclass pairs from 
the same class. On this basis channels 
2, 8, 11, and 12 were selected. Using 
these channels both the training acres 
as well as the test fields were class- 
ified both with LARSYSAA and PERFIELD . 
The classification results for the 
training acres are shown in Fig. 6 
while the results for the test fields 
(again averaged over the 3 flightlines) 
are shown in Fig. 7- 

Discussion of Experimental Results 

It is suggested that in evaluating 
a classifier a reasonable index of com- 
parison is the overall average classi- 
fication accuracy. This performance 
index has the advantage that it gives 
an indication of the classification 
accuracy that might be expected from 
the classifier for similar data and 
training procedures. For a relatively 
small data set, it is usually rela- 
tively easy to devise a training pro- 
cedure or classifier which superfici- 
ally looks superior but whose apparent 
superiority disappears when results 
are averaged over a number of data 
sets. A disadvantage of the suggested 
performance index is the necessity to 
do a reasonable number of classifica- 
tions . 


On the basis of average classifi- 
cation accuracy and the training pro- 
cedures used there is no evidence that 
the parametric minimum distance class- 
ifier is superior to the nonparametric 
classifier. Neither is there any evi- 
dence that using a relatively large 
number of subclasses improves classifi- 
cation accuracy on the average. This 
is contrary to expectations. 

Actually when each field is 
treated as a subclass one would expect 
the nonparametric classifier to per- 
form better than the parametric class- 
ifier only if the Gaussian assumption 
was seriously violated for the various 
training or test fields involved. 
Futhermore, for the nonparametric 
classifier to exhibit any real advan- 
tage the nonnormal structure of the 
data must bear some resemblence from 
field to field (e.g. modes must appear 
in same relative positions). Since the 
nonparametric classifier does not ex- 
hibit any superior performance neither 
of the above factors apparently occur 
with any consistency. 

When the data from all the train- 
ing fields is grouped one would expect 
that the data would be multimodal and 
that the nonparametric classifier would 
be much superior. The basic fallacy 
in this reasoning appears to be that 
although the class distributions are 
multimodal the samples to be classified 
are usually unimodal. In other words 
the distribution of any sample to be 
classified is not really a random sam- 
ple from the distribution of any class . 
Instead it simply tends to account for 
one of the modes in the class distri- 
bution. Futhermore, there is no appar- 
ent way of rectifying this situation 
within the constraint of minimum dis- 
tance classification. 

The fact that the parametric 
classifier does so well (comparatively) 
when no subclasses are considered 
attests to the robustness* of the 

* A robust classifier is relatively 
insensitive to the underlying assump- 
tions about the distributions involved. 
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Gaussian assumption in minimum distance 
classification . 

It must be recognized that in 
assessing a classifier factors other 
than the performance index considered 
are of importance. One other factor 
that should be considered is the con- 
sistency of the results. That is, how 
near to the average can one expect to 
get for any given classification. The 
variance in the average performance is 
a measure of this consistency. In this 
regard, although the number of classi- 
fications is small, there is evidence 
that the nonparametric classifier is 
better than the parametric version and 
that for the parametric classifier the 
variance in average performance is in- 
creased by combining the data from 
many fields . This small advantage 
hardly warrants the additional complex- 
ity of the nonparametric implementa- 
tion . 

The results comparing the minimum 
distance and maximum likelihood class- 
ifiers show fairly conclusively that 
in general the sample classification 
accuracy of minimum distance classi- 
fiers is higher than the vector class- 
ification accuracy of maximum likeli- 
hood classifier of the same data. 

This is true for both the test and 
training data. It is recognized of 
course that the quantities being com- 
pared are by nature somewhat different 
but nevertheless they represent the 
natural method of expressing the class- 
ification accuracy of each classifier 
individually and do afford some measure 
of comparison. This result agrees with 
expectations although a greater improve- 
ment might have been anticipated. 

It is convenient to define the 
difference between the sample classi- 
fication accuracy and the vector 
classification accuracy as the improve- 
ment factor. The exact value of the 
improvement factor depends on the par- 
ticular data but qualitatively it is 
obvious that for Type II case (a) pro- 
blems the improvement will be very 
small or non existent both when the 
separation of the parameter space 


densities for all classes is large 
(one can't improve a high vector class- 
ification accuracy much) as well as 
when no separation exists (subclasses 
of different main classes can then not 
be distinguished by either classifier). 
The experimental evidence suggest that 
for moderate overlap of the parameter 
space densities the improvement factor 
will be of the order of 5 % to 10%. 

In concluding it should be men- 
tioned that no comparative computation 
times have been given. The fact that 
the experiments involved a number of 
different programs, two computer sys- 
tems (one in a time sharing mode) and 
the inherent dependence of processing 
time on the Classification Parameters 
and on the manner in which the data 
is stored (data retrieval time is by 
no means negligible) makes it virtual- 
ly impossible to give meaningful com- 
parative times. Suffice it to say 
that to classify a typical flightline 
time would be measured in fractions of 
an hour to hours on an IBM 3 60 System 
Model 1+4, and that PERFIELD is the 
fastest classifier, followed by 
LARSYSDC and LARSYSAA in that order. 


CLOSURE 

Although only two examples have 
been presented numerous other classi- 
fications have been performed on simi- 
lar data and the results generally 
support the results presented. Even 
considering only the classification 
discussed the volume of data involved 
is quite substantial and is certainly 
adequate for a reasonable test. 

For the type of data considered 
two basic conclusions appear reason- 
able . 

(l) The classification accuracy of a 
nonparametric minimum distance classi- 
fiers, utilizing density histograms 
for estimating pdf's, is on the average 
not any larger than the classification 
accuracy of the parametric (Gaussian) 
classifier based on parametrically 
estimated pdf's. The variability in 
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performance of the nonparametric class- 
ifier appears somewhat smaller. Since 
the parametric classifier requires less 
storage and is faster than the nonpara- 
metric classifier the latter classifier 
is not an attractive alternative. 

(2) The average sample classification 
accuracy of a parametric (Gaussian) 
minimum distance classifier is larger 
than the average vector classification 
accuracy of a miximum likelihood vector 
classifier. Ignoring the problem of 
sample definition the minimum distance 
classifier is faster and is an attrac- 
tive alternative to the maximum like- 
lihood classifier in situations where 
it can be utilized. 

The disparity between test and 
training results for both minimum dis- 
tance and maximum likelihood classi- 
fiers is much greater than the differ- 
ence due to classifier type or the 
specific implementation. This suggests 
that given the present state of the art 
greater improvement in classification 
accuracies will probably result from 
investigations intended to improve the 
training procedure than from investi- 
gation of classifier types. 
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Fig. 3 Location of Tippecanoe County Flightlines 21, 23 and 2k 
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Fig. 4 Comparison of Average Test Performance 

for Parametric and Nonparametric Minimum 
Distance Classification Using Bhattacharyya 
Distance and Random Training 
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Fig. 5 Histograms for Wheat Subclasses Obtained as Result of Clustering Wheat Acres 
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Fig. 6 Comparison of the Training Performance for Minimum Distance 
and Maximum Likelihood Classification 
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Fig. 7 Comparison of Average Test Performance of Minimum 
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